16 research outputs found

    SMAuC -- The Scientific Multi-Authorship Corpus

    Full text link
    With an ever-growing number of new publications each day, scientific writing poses an interesting domain for authorship analysis of both single-author and multi-author documents. Unfortunately, most existing corpora lack either material from the science domain or the required metadata. Hence, we present SMAuC, a new metadata-rich corpus designed specifically for authorship analysis in scientific writing. With more than three million publications from various scientific disciplines, SMAuC is the largest openly available corpus for authorship analysis to date. It combines a wide and diverse range of scientific texts from the humanities and natural sciences with rich and curated metadata, including unique and carefully disambiguated author IDs. We hope SMAuC will contribute significantly to advancing the field of authorship analysis in the science domain

    Evaluating Generative Ad Hoc Information Retrieval

    Full text link
    Recent advances in large language models have enabled the development of viable generative information retrieval systems. A generative retrieval system returns a grounded generated text in response to an information need instead of the traditional document ranking. Quantifying the utility of these types of responses is essential for evaluating generative retrieval systems. As the established evaluation methodology for ranking-based ad hoc retrieval may seem unsuitable for generative retrieval, new approaches for reliable, repeatable, and reproducible experimentation are required. In this paper, we survey the relevant information retrieval and natural language processing literature, identify search tasks and system architectures in generative retrieval, develop a corresponding user model, and study its operationalization. This theoretical analysis provides a foundation and new insights for the evaluation of generative ad hoc retrieval systems.Comment: 14 pages, 5 figures, 1 tabl

    Overview of the Authorship Verification Task at PAN 2022

    Get PDF
    The authorship verification task at PAN 2022 follows the experimental setup of similar shared tasks in the recent past. However, it focuses on a different, and very challenging scenario: given two texts belonging to different discourse types, the task is to determine whether they are written by the same author. Based on a new corpus in English, we provide pairs of texts using four discourse types: essays, emails, text messages, and business memos. The differences in communicative purpose, intended audience, and the level of formality render the cross-discourse-type authorship verification task very hard. We received 7 submissions and evaluated them using the TIRA integrated research architecture, along with two baseline approaches. This paper reviews the submissions and presents a detailed discussion of the evaluation results

    Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection

    Full text link
    [EN] We briefly report on the four shared tasks organized as part of the PAN 2020 evaluation lab on digital text forensics and authorship analysis. Each tasks is introduced, motivated, and the results obtained are presented. Altogether, the four tasks attracted 230 registrations, yielding 83 successful submissions. This, and the fact that we continue to invite the submissions of software rather than its run output using the TIRA experimentation platform, marks for a good start into the second decade of PAN evaluations labs.We thank Symanto for sponsoring the ex aequo award for the two best performing systems at the author profiling shared task of this year on Profiling fake news spreaders on Twitter. The work of Paolo Rosso was partially funded by the Spanish MICINN under the research project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018¿096212-B-C31). The work of Anastasia Giachanou is supported by the SNSF Early Postdoc Mobility grant under the project Early Fake News Detection on Social Media, Switzerland (P2TIP2_181441).Bevendorff, J.; Ghanem, BHH.; Giachanou, A.; Kestemont, M.; Manjavacas, E.; Markov, I.; Mayerl, M.... (2020). Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection. Springer. 372-383. https://doi.org/10.1007/978-3-030-58219-7_25S372383Bevendorff, J., et al.: Shared tasks on authorship analysis at PAN 2020. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 508–516. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_66Bevendorff, J., Stein, B., Hagen, M., Potthast, M.: Bias analysis and mitigation in the evaluation of authorship verification. In: 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 6301–6306 (2019)Bevendorff, J., Stein, B., Hagen, M., Potthast, M.: Generalizing unmasking for short texts. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pp. 654–659 (2019)Ghanem, B., Rosso, P., Rangel, F.: An emotional analysis of false information in social media and news articles. ACM Trans. Internet Technol. (TOIT) 20(2), 1–18 (2020)Giachanou, A., Ríssola, E.A., Ghanem, B., Crestani, F., Rosso, Paolo: The role of personality and linguistic patterns in discriminating between fake news spreaders and fact checkers. In: Métais, E., Meziane, F., Horacek, H., Cimiano, P. (eds.) NLDB 2020. LNCS, vol. 12089, pp. 181–192. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51310-8_17Giachanou, A., Rosso, P., Crestani, F.: Leveraging emotional signals for credibility detection. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 877–880 (2019)Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M., Stein, B.: Overview of the cross-domain authorship attribution task at PAN 2019. In: Working Notes Papers of the CLEF 2019 Evaluation Labs. CEUR Workshop Proceedings (2019)Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings (2018)Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011)Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th author profiling task at PAN 2020: profiling fake news spreaders on Twitter. In: CLEF 2020 Labs and Workshops, Notebook Papers (2020)Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9624, pp. 156–169. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75487-1_13Shu, K., Wang, S., Liu, H.: Understanding user profiles on social media for fake news detection. In: 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430–435 (2018)Vo, N., Lee, K.: Learning from fact-checkers: analysis and generation of fact-checking language. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019)Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses: An Introduction. A Wiley-Interscience Publication, Hoboken (1989)Wiegmann, M., Potthast, M., Stein, B.: Overview of the celebrity profiling task at PAN 2020. In: CLEF 2020 Labs and Workshops, Notebook Papers (2020)Wiegmann, M., Stein, B., Potthast, M.: Celebrity profiling. In: 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Association for Computational Linguistics (2019)Wiegmann, M., Stein, B., Potthast, M.: Overview of the celebrity profiling task at PAN 2019. In: CLEF 2019 Labs and Workshops, Notebook Papers (2019)Zangerle, E., Mayerl, M., Specht, G., Potthast, M., Stein, B.: Overview of the style change detection task at PAN 2020. In: CLEF 2020 Labs and Workshops, Notebook Papers (2020)Zangerle, E., Tschuggnall, M., Specht, G., Potthast, M., Stein, B.: Overview of the style change detection task at PAN 2019. In: CLEF 2019 Labs and Workshops, Notebook Papers (2019

    Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection.

    Full text link
    [EN] The paper gives a brief overview of the three shared tasks to be organized at the PAN 2021 lab on digital text forensics and stylometry hosted at the CLEF conference. The tasks include authorship verification across domains, author profiling for hate speech spreaders, and style change detection for multi-author documents. In part the tasks are new and in part they continue and advance past shared tasks, with the overall goal of advancing the state of the art, providing for an objective evaluation on newly developed benchmark datasets.The work of the researchers from Universitat Politecnica de Valencia was partially funded by the Spanish MICINN under the project MISMISFAKEnHATE on MISinformation and MIScommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31), and by the Generalitat Valenciana under the project DeepPattern (PROMETEO/2019/121).Bevendorff, J.; Chulvi-Ferriols, MA.; Peña-Sarracén, GLDL.; Kestemont, M.; Manjavacas, E.; Markov, I.; Mayerl, M.... (2021). Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection. Springer. 567-573. https://doi.org/10.1007/978-3-030-72240-1_6656757

    PAN20 Authorship Analysis: Authorship Verification

    No full text
    Task Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles. In the coming three years at PAN 2020 to PAN 2022, we develop a new experimental setup that addresses three key questions in authorship verification that have not been studied at scale to date: Year 1 (PAN 2020): Closed-set verficiation. Given a large training dataset comprising of known authors who have written about a given set of topics, the test dataset contains verification cases from a subset of the authors and topics found in the training data. Year 2 (PAN 2021): Open-set verification. Given the training dataset of Year 1, the test dataset contains verification cases from previously unseen authors and topics. Year 3 (PAN 2022): Suprise task. The task of the last year of this evaluation cycle (to be announced at a later time) will be designed with an eye on realism and practical application. This evaluation cycle on authorship verification provides for a renewed challenge of increasing difficulty within a large-scale evaluation. We invite you to plan ahead and participate in all three of these tasks. More information at: PAN @ CLEF 2020 - Authorship Verification Citing the Dataset If you use this dataset for your research, please be sure to cite the following paper: Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast. The Importance of Suppressing Domain Style in Authorship Analysis. CoRR, abs/2005.14714, May 2020. Bibtex: @Article{stein:2020k, author = {Sebastian Bischoff and Niklas Deckers and Marcel Schliebs and Ben Thies and Matthias Hagen and Efstathios Stamatatos and Benno Stein and Martin Potthast}, journal = {CoRR}, month = may, title = {{The Importance of Suppressing Domain Style in Authorship Analysis}}, url = {https://arxiv.org/abs/2005.14714}, volume = {abs/2005.14714}, year = 2020
    corecore